GOM-Hadoop: A distributed framework for efficient analytics on ordered datasets

نویسندگان

Jiangtao Yin

Yong Liao

Mario Baldi

Lixin Gao

Antonio Nucci

چکیده

One of the most common datasets exploited by many corporations to conduct business intelligence analysis is event log files. Oftentimes, the records in event log files are temporally ordered, and need to be grouped by certain key with the temporal ordering preserved to facilitate further analysis. One such example is to group temporally ordered events by user ID in order to analyze user behavior. This kind of analytical workload, here referred to as RElative Order-pReserving based Grouping (Re-Org), is quite common in big data analytics, where the MapReduce programming paradigm (and its opensource implementation, Hadoop) is widely adopted for massive parallel processing. However, using MapReduce/Hadoop for executing Re-Org tasks on ordered datasets is not efficient due to its internal sort–merge mechanism when shuffling data from mappers to reducers. In this paper, we propose a distributed framework that adopts an efficient group-order–mergemechanism to speed up the execution of Re-Org tasks. We demonstrate the advantage of our framework by formally modeling its execution process and by comparing its performance with Hadoop through extensive experiments on real-world datasets. The evaluation results show that our framework can achieve up to 6.3x speedup over Hadoop in executing Re-Org tasks. © 2015 Elsevier Inc. All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Study of Adverse Drug Reactions in Paediatric FAERS

The emergence of massive datasets in a FAERS presents both challenges and Opportunities in data analysis. This so called “big data” challenges and will increasingly require novel solutions customized from related domains. An advance in information and communication technology provides the most feasible solutions to big data analysis in terms of efficiency and scalability. The MapReduce programm...

متن کامل

Redoop: Supporting Recurring Queries in Hadoop

The growing demand for large-scale data analytics ranging from online advertisement placement, log processing, to fraud detection, has led to the design of highly scalable data-intensive computing infrastructures such as the Hadoop platform. Recurring queries, repeatedly being executed for long periods of time on rapidly evolving high-volume data, have become a bedrock component in most of thes...

متن کامل

Hone: "Scaling Down" Hadoop on Shared-Memory Systems

The underlying assumption behind Hadoop and, more generally, the need for distributed processing is that the data to be analyzed cannot be held in memory on a single machine. Today, this assumption needs to be re-evaluated. Although petabyte-scale datastores are increasingly common, it is unclear whether “typical” analytics tasks require more than a single high-end server. Additionally, we are ...

متن کامل

Optimization Techniques for "Scaling Down" Hadoop on Multi-Core, Shared-Memory Systems

متن کامل

Spatio-Temporal Big Data Analytics for Environmental Health

The framework for our proposed big data analytics platform is shown in Figure 1. Two complimentary systems support the wide variety of spatial analytics algorithms and techniques we are providing. On the left half of Figure 1, the more-traditional unix filesystem supports high-throughput computation (e.g., MPI [Snir et al., 1995], OpenMP [Dagum and Menon, 1998], GPGPU/CUDA Luebke et al. [2006])...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

J. Parallel Distrib. Comput.

دوره 83 شماره

صفحات -

تاریخ انتشار 2015

GOM-Hadoop: A distributed framework for efficient analytics on ordered datasets

نویسندگان

چکیده

منابع مشابه

A Study of Adverse Drug Reactions in Paediatric FAERS

Redoop: Supporting Recurring Queries in Hadoop

Hone: "Scaling Down" Hadoop on Shared-Memory Systems

Optimization Techniques for "Scaling Down" Hadoop on Multi-Core, Shared-Memory Systems

Spatio-Temporal Big Data Analytics for Environmental Health

عنوان ژورنال:

اشتراک گذاری